- 
        Couldn't load subscription status. 
- Fork 25.6k
Ensure ivf postings lists are in docID order #129655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| Pinging @elastic/es-search-relevance (Team:Search Relevance) | 
| I did some benchmarking, it doesn't give us much space savings (yet), but it didn't hurt performance. | 
| The idea for not sorting by docId was to favour having vectors together that were not SOAR vectors so bulk scoring is | 
| I need to benchmark this in highly filtered scenarios (e.g. when we will search more centroids), to ensure this doesn't hurt search performance. | 
| I ran some higher-recall filtered search scenarios and there is basically zero increase in query latency. baseline: this PR:  | 
| // keeping them in the same file indicates we pull the entire file into cache | ||
| docIdsWriter.writeDocIds(j -> floatVectorValues.ordToDoc(cluster[j]), size, postingsOutput); | ||
| postingsOutput.writeGroupVInts(docIds, size); | ||
| postingsOutput.writeGroupVInts(spillDocIds, overspillCluster.length); | ||
| onHeapQuantizedVectors.reset(centroid, size, j -> cluster[finalOrds[j]]); | ||
| bulkWriter.writeVectors(onHeapQuantizedVectors); | ||
| // write overspill vectors | ||
| onHeapQuantizedVectors.reset(centroid, overspillCluster.length, j -> overspillCluster[finalSpillOrds[j]]); | ||
| bulkWriter.writeVectors(onHeapQuantizedVectors); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sort of what you had in mind @iverase ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. My idea is that we can use this information to first score the assignments of all the clusters we want to visit so we can ensure that the posting lists will be unique and have simpler (and faster) visiting logic, and later visit the spill assignments where we would have a more complex (slower) logic to remove already visited posting lists. The downside of this approach is that it will require more hops on the posting list files breaking a bit the disk friendly approach of this type of index.
Do you think this is something doable? it complicated the search logic quite a bit and maybe the benefits are too small. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iverase the main benefit of SOAR and overspilling in general is that fewer nProbe need to be gathered. I would expect us to score both regular and overspill up to some fraction of nprobe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not saying that we would not score both but doing one after the other so if there is not deletes, scoring unique posting lists would be faster (no need to process docIds before scoring them). I can see it gets hairy and it would require different logic branches which is not great.
If you see no performance impact, I think we can just order all docs. We can make the distinction later on if is ever required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, we already have all the vectors are already in doc order. But, when we combine initial and secondary assignments into one grouping, that is when you get a partial ordering.
Each assignment array is already in vector ordinal order, which also means that they are already in doc Id order.
This PR now just keeps them separate (no sorting required).
Overall, there don't seem to be significant performance gains.
I noticed slightly lower performance when no filters are provided.
I noticed higher performance with very restrictive filters are used with high-nprobe, but I don't really know why.
I wouldn't expect doc ID decoding to be a significant issue.
I am gonna leave this as it is and we can revisit it at a later time, unless you have a better intuition around this.
…csearch into ivf/ensure-doc-id-order
| done elsewhere | 
This PR is pretty basic, right now we don't enforce any ordering at all for our IVF postings lists.
It seems like we should at a minimum make sure they are in doc-id order.
If we decide to switch this in the future, at least we will have a consistent ordering.